83 research outputs found

    An analysis of the Sargasso Sea resource and the consequences for database composition

    Get PDF
    Background: The environmental sequencing of the Sargasso Sea has introduced a huge new resource of genomic information. Unlike the protein sequences held in the current searchable databases, the Sargasso Sea sequences originate from a single marine environment and have been sequenced from species that are not easily obtainable by laboratory cultivation. The resource also contains very many fragments of whole protein sequences, a side effect of the shotgun sequencing method.These sequences form a significant addendum to the current searchable databases but also present us with some intrinsic difficulties. While it is important to know whether it is possible to assign function to these sequences with the current methods and whether they will increase our capacity to explore sequence space, it is also interesting to know how current bioinformatics techniques will deal with the new sequences in the resource.Results: The Sargasso Sea sequences seem to introduce a bias that decreases the potential of current methods to propose structure and function for new proteins. In particular the high proportion of sequence fragments in the resource seems to result in poor quality multiple alignments.Conclusion: These observations suggest that the new sequences should be used with care, especially if the information is to be used in large scale analyses. On a positive note, the results may just spark improvements in computational and experimental methods to take into account the fragments generated by environmental sequencing techniques

    Coding potential of the products of alternative splicing in human

    Get PDF
    Background: Analysis of the human genome has revealed that as much as an order of magnitude more of the genomic sequence is transcribed than accounted for by the predicted and characterized genes. A number of these transcripts are alternatively spliced forms of known protein coding genes; however, it is becoming clear that many of them do not necessarily correspond to a functional protein. Results: In this study we analyze alternative splicing isoforms of human gene products that are unambiguously identified by mass spectrometry and compare their properties with those of isoforms of the same genes for which no peptide was found in publicly available mass spectrometry datasets. We analyze them in detail for the presence of uninterrupted functional domains, active sites as well as the plausibility of their predicted structure. We report how well each of these strategies and their combination can correctly identify translated isoforms and derive a lower limit for their specificity, that is, their ability to correctly identify non-translated products. Conclusions: The most effective strategy for correctly identifying translated products relies on the conservation of active sites, but it can only be applied to a small fraction of isoforms, while a reasonably high coverage, sensitivity and specificity can be achieved by analyzing the presence of non-truncated functional domains. Combining the latter with an assessment of the plausibility of the modeled structure of the isoform increases both coverage and specificity with a moderate cost in terms of sensitivity

    Alternative splicing and the evolution of phenotypic novelty

    Get PDF
    Alternative splicing, a mechanism of post-transcriptional RNA processing whereby a single gene can encode multiple distinct transcripts, has been proposed to underlie morphological innovations in multicellular organisms. Genes with developmental functions are enriched for alternative splicing events, suggestive of a contribution of alternative splicing to developmental programmes. The role of alternative splicing as a source of transcript diversification has previously been compared to that of gene duplication, with the relationship between the two extensively explored. Alternative splicing is reduced following gene duplication with the retention of duplicate copies higher for genes which were alternatively spliced prior to duplication. Furthermore, and unlike the case for overall gene number, the proportion of alternatively spliced genes has also increased in line with the evolutionary diversification of cell types, suggesting alternative splicing may contribute to the complexity of developmental programmes. Together these observations suggest a prominent role for alternative splicing as a source of functional innovation. However, it is unknown whether the proliferation of alternative splicing events indeed reflects a functional expansion of the transcriptome or instead results from weaker selection acting on larger species, which tend to have a higher number of cell types and lower population sizes.This article is part of the themed issue 'Evo-devo in the genomics era, and the origins of morphological diversity'

    High-Throughput Proteomics Detection of Novel Splice Isoforms in Human Platelets

    Get PDF
    Alternative splicing (AS) is an intrinsic regulatory mechanism of all metazoans. Recent findings suggest that 100% of multiexonic human genes give rise to splice isoforms. AS can be specific to tissue type, environment or developmentally regulated. Splice variants have also been implicated in various diseases including cancer. Detection of these variants will enhance our understanding of the complexity of the human genome and provide disease-specific and prognostic biomarkers. We adopted a proteomics approach to identify exon skip events - the most common form of AS. We constructed a database harboring the peptide sequences derived from all hypothetical exon skip junctions in the human genome. Searching tandem mass spectrometry (MS/MS) data against the database allows the detection of exon skip events, directly at the protein level. Here we describe the application of this approach to human platelets, including the mRNA-based verification of novel splice isoforms of ITGA2, NPEPPS and FH. This methodology is applicable to all new or existing MS/MS datasets

    The fitness cost of mis-splicing is the main determinant of alternative splicing patterns

    Get PDF
    Background Most eukaryotic genes are subject to alternative splicing (AS), which may contribute to the production of protein variants or to the regulation of gene expression via nonsense-mediated messenger RNA (mRNA) decay (NMD). However, a fraction of splice variants might correspond to spurious transcripts and the question of the relative proportion of splicing errors to functional splice variants remains highly debated. Results We propose a test to quantify the fraction of AS events corresponding to errors. This test is based on the fact that the fitness cost of splicing errors increases with the number of introns in a gene and with expression level. We analyzed the transcriptome of the intron-rich eukaryote Paramecium tetraurelia. We show that in both normal and in NMD-deficient cells, AS rates strongly decrease with increasing expression level and with increasing number of introns. This relationship is observed for AS events that are detectable by NMD as well as for those that are not, which invalidates the hypothesis of a link with the regulation of gene expression. Our results show that in genes with a median expression level, 92–98% of observed splice variants correspond to errors. We observed the same patterns in human transcriptomes and we further show that AS rates correlate with the fitness cost of splicing errors. Conclusions These observations indicate that genes under weaker selective pressure accumulate more maladaptive substitutions and are more prone to splicing errors. Thus, to a large extent, patterns of gene expression variants simply reflect the balance between selection, mutation, and drift

    Assessment of orthologous splicing isoforms in human and mouse orthologous genes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Recent discoveries have highlighted the fact that alternative splicing and alternative transcripts are the rule, rather than the exception, in metazoan genes. Since multiple transcript and protein variants expressed by the same gene are, by definition, structurally distinct and need not to be functionally equivalent, the concept of gene orthology should be extended to the transcript level in order to describe evolutionary relationships between structurally similar transcript variants. In other words, the identification of true orthology relationships between gene products now should progress beyond primary sequence and "splicing orthology", consisting in ancestrally shared exon-intron structures, is required to define orthologous isoforms at transcript level.</p> <p>Results</p> <p>As a starting step in this direction, in this work we performed a large scale human- mouse gene comparison with a twofold goal: first, to assess if and to which extent traditional gene annotations such as RefSeq capture genuine splicing orthology; second, to provide a more detailed annotation and quantification of true human-mouse orthologous transcripts defined as transcripts of orthologous genes exhibiting the same splicing patterns.</p> <p>Conclusions</p> <p>We observed an identical exon/intron structure for 32% of human and mouse orthologous genes. This figure increases to 87% using less stringent criteria for gene structure similarity, thus implying that for about 13% of the human RefSeq annotated genes (and about 25% of the corresponding transcripts) we could not identify any mouse transcript showing sufficient similarity to be confidently assigned as a splicing ortholog. Our data suggest that current gene and transcript data may still be rather incomplete - with several splicing variants still unknown. The observation that alternative splicing produces large numbers of alternative transcripts and proteins, some of them conserved across species and others truly species-specific, suggests that, still maintaining the conventional definition of gene orthology, a new concept of "splicing orthology" can be defined at transcript level.</p

    Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets

    Get PDF
    BACKGROUND: The scale and diversity of metagenomic sequencing projects challenge both our technical and conceptual approaches in gene and genome annotations. The recent Sorcerer II Global Ocean Sampling (GOS) expedition yielded millions of predicted protein sequences, which significantly altered the landscape of known protein space by more than doubling its size and adding thousands of new families (Yooseph et al., 2007 PLoS Biol 5, e16). Such datasets, not only by their sheer size, but also by many other features, defy conventional analysis and annotation methods. METHODOLOGY/PRINCIPAL FINDINGS: In this study, we describe an approach for rapid analysis of the sequence diversity and the internal structure of such very large datasets by advanced clustering strategies using the newly modified CD-HIT algorithm. We performed a hierarchical clustering analysis on the 17.4 million Open Reading Frames (ORFs) identified from the GOS study and found over 33 thousand large predicted protein clusters comprising nearly 6 million sequences. Twenty percent of these clusters did not match known protein families by sequence similarity search and might represent novel protein families. Distributions of the large clusters were illustrated on organism composition, functional class, and sample locations. CONCLUSION/SIGNIFICANCE: Our clustering took about two orders of magnitude less computational effort than the similar protein family analysis of original GOS study. This approach will help to analyze other large metagenomic datasets in the future. A Web server with our clustering results and annotations of predicted protein clusters is available online at http://tools.camera.calit2.net/gos under the CAMERA project

    Measuring Global Credibility with Application to Local Sequence Alignment

    Get PDF
    Computational biology is replete with high-dimensional (high-D) discrete prediction and inference problems, including sequence alignment, RNA structure prediction, phylogenetic inference, motif finding, prediction of pathways, and model selection problems in statistical genetics. Even though prediction and inference in these settings are uncertain, little attention has been focused on the development of global measures of uncertainty. Regardless of the procedure employed to produce a prediction, when a procedure delivers a single answer, that answer is a point estimate selected from the solution ensemble, the set of all possible solutions. For high-D discrete space, these ensembles are immense, and thus there is considerable uncertainty. We recommend the use of Bayesian credibility limits to describe this uncertainty, where a (1βˆ’Ξ±)%, 0≀α≀1, credibility limit is the minimum Hamming distance radius of a hyper-sphere containing (1βˆ’Ξ±)% of the posterior distribution. Because sequence alignment is arguably the most extensively used procedure in computational biology, we employ it here to make these general concepts more concrete. The maximum similarity estimator (i.e., the alignment that maximizes the likelihood) and the centroid estimator (i.e., the alignment that minimizes the mean Hamming distance from the posterior weighted ensemble of alignments) are used to demonstrate the application of Bayesian credibility limits to alignment estimators. Application of Bayesian credibility limits to the alignment of 20 human/rodent orthologous sequence pairs and 125 orthologous sequence pairs from six Shewanella species shows that credibility limits of the alignments of promoter sequences of these species vary widely, and that centroid alignments dependably have tighter credibility limits than traditional maximum similarity alignments

    Allelic Gene Structure Variations in Anopheles gambiae Mosquitoes

    Get PDF
    Allelic gene structure variations and alternative splicing are responsible for transcript structure variations. More than 75% of human genes have structural isoforms of transcripts, but to date few studies have been conducted to verify the alternative splicing systematically.The present study used expressed sequence tags (ESTs) and EST tagged SNP patterns to examine the transcript structure variations resulting from allelic gene structure variations in the major human malaria vector, Anopheles gambiae. About 80% of 236,004 available A. gambiae ESTs were successfully aligned to A. gambiae reference genomes. More than 2,340 transcript structure variation events were detected. Because the current A. gambiae annotation is incomplete, we re-annotated the A. gambiae genome with an A. gambiae-specific gene model so that the effect of variations on gene coding could be better evaluated. A total of 15,962 genes were predicted. Among them, 3,873 were novel genes and 12,089 were previously identified genes. The gene completion rate improved from 60% to 84%. Based on EST support, 82.5% of gene structures were predicted correctly. In light of the new annotation, we found that approximately 78% of transcript structure variations were located within the coding sequence (CDS) regions, and >65% of variations in the CDS regions have the same open-reading-frame. The association between transcript structure isoforms and SNPs indicated that more than 28% of transcript structure variation events were contributed by different gene alleles in A. gambiae.We successfully expanded the A. gambiae genome annotation. We predicted and analyzed transcript structure variations in A. gambiae and found that allelic gene structure variation plays a major role in transcript diversity in this important human malaria vector

    Entropy Measures Quantify Global Splicing Disorders in Cancer

    Get PDF
    Most mammalian genes are able to express several splice variants in a phenomenon known as alternative splicing. Serious alterations of alternative splicing occur in cancer tissues, leading to expression of multiple aberrant splice forms. Most studies of alternative splicing defects have focused on the identification of cancer-specific splice variants as potential therapeutic targets. Here, we examine instead the bulk of non-specific transcript isoforms and analyze their level of disorder using a measure of uncertainty called Shannon's entropy. We compare isoform expression entropy in normal and cancer tissues from the same anatomical site for different classes of transcript variations: alternative splicing, polyadenylation, and transcription initiation. Whereas alternative initiation and polyadenylation show no significant gain or loss of entropy between normal and cancer tissues, alternative splicing shows highly significant entropy gains for 13 of the 27 cancers studied. This entropy gain is characterized by a flattening in the expression profile of normal isoforms and is correlated to the level of estimated cellular proliferation in the cancer tissue. Interestingly, the genes that present the highest entropy gain are enriched in splicing factors. We provide here the first quantitative estimate of splicing disruption in cancer. The expression of normal splice variants is widely and significantly disrupted in at least half of the cancers studied. We postulate that such splicing disorders may develop in part from splicing alteration in key splice factors, which in turn significantly impact multiple target genes
    • …
    corecore